Skip to content

fix: wait for cloud-init completion before connecting to new VMs#930

Merged
rysweet merged 7 commits intomainfrom
fix/issue-929-cloud-init-wait
Apr 2, 2026
Merged

fix: wait for cloud-init completion before connecting to new VMs#930
rysweet merged 7 commits intomainfrom
fix/issue-929-cloud-init-wait

Conversation

@rysweet
Copy link
Copy Markdown
Owner

@rysweet rysweet commented Apr 2, 2026

Summary

Fixes #929 — VM creation now waits for cloud-init provisioning to complete before forwarding credentials or connecting the user.

  • Increase SSH wait timeout from 120s to 300s
  • Add wait_for_cloud_init() — polls cloud-init status over SSH every 10s until done/disabled/error (600s max)
  • Add resolve_ssh_key() + base_ssh_args() — inject identity key (~/.ssh/azlin_key) into all SSH/SCP operations
  • Handle status: disabled as terminal (prevents 600s hang on non-cloud-init VMs)
  • Add ConnectTimeout=10 to ssh_output to prevent hung connections
  • Update credential-forwarding reference docs

Step 13: Local Testing Results

Test Environment: fix/issue-929-cloud-init-wait branch, release binary, 2026-04-01
Tests Executed:

  1. Simple: azlin new --name test-929-v3 --yes --no-auto-connect → SSH connected, cloud-init waited, credentials forwarded ✅
  2. Complex: Verified cloud-init status reports "done" before proceeding, identity key used for SSH/SCP ✅

Regressions: All 2180 unit tests pass, clippy clean ✅
Issues Found: Cloud-init script itself has pre-existing issues installing some tools (gh, node, az) — separate from this PR's wait-for-completion fix.

Test plan

  • Create a new VM with azlin new --name <name> and verify:
    • SSH connects successfully (identity key resolved)
    • "Waiting for cloud-init..." message appears
    • "Cloud-init provisioning complete." appears before credential forwarding
    • Credentials forwarded successfully (gh, claude, az)
  • Verify existing azlin connect to already-provisioned VMs is unaffected
  • Verify bastion tunnel path still works
  • Run cargo test --package azlin — all tests pass

🤖 Generated with Claude Code

Ryan Sweet and others added 5 commits March 27, 2026 13:19
Add all development tools that were in the Python vm_provisioning.py but
missing from the Rust cloud_init implementation:

- GitHub CLI (gh) via official apt repo
- Azure CLI via InstallAzureCLIDeb script
- Node.js 22.x via NodeSource
- Claude Code AI assistant
- Go 1.24.1
- Python 3.13 + python-is-python3
- uv package manager
- tmux configuration (status bar, socket permissions)
- Docker post-install (add user to docker group)
- npm global prefix configuration
- .bashrc PATH additions (Go, Cargo, npm)

Updates both cloud-init code paths:
- cloud_init.rs: YAML-based cloud-init config (packages + runcmd)
- vm.rs: shell-script based cloud-init provisioning

Removes broken amplihack make install (target doesn't exist).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quality audit findings:
- default_dev_setup_commands() now takes username parameter instead of
  hardcoding 'azureuser' (HIGH: would fail for non-default usernames)
- tmux socket dir uses dynamic UID via id -u instead of hardcoded 1000
  (MEDIUM: would fail if user UID != 1000)
- Added version verification step to shell-script cloud-init path
  (matches YAML path's existing verification)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quality audit cycle 2 findings:
- rustc --version ran as root but Rust is installed in user homedir
  (HIGH: verification always failed even when install succeeded)
- Standardize on apt-get upgrade instead of full-upgrade to match
  shell-script path and avoid unexpected package removal (MEDIUM)
- Remove unnecessary ripgrep reinstall (only needed with full-upgrade)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Quality audit cycle 3: eliminate predictable /tmp path for GitHub CLI
GPG keyring download. Download directly to /etc/apt/keyrings/ in both
cloud-init code paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After VM creation, azlin now waits for cloud-init provisioning to finish
before forwarding credentials or auto-connecting the user. Previously,
the code only waited for SSH to become reachable (~2 min), but cloud-init
takes 5-10 min to install all tools (gh, az, node, rustc, etc.).

Changes:
- Increase SSH wait timeout from 120s to 300s
- Add wait_for_cloud_init() that polls cloud-init status over SSH every
  10s until done/disabled/error (600s timeout, best-effort)
- Add ssh_output() helper that captures remote command stdout
- Add resolve_ssh_key() + base_ssh_args() to inject identity key
  (~/.ssh/azlin_key) into all SSH/SCP operations
- Handle cloud-init "disabled" state as terminal (no 600s hang)
- Add ConnectTimeout=10 to ssh_output to prevent hung connections
- Update credential-forwarding docs with cloud-init wait behavior

Fixes #929

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rysweet
Copy link
Copy Markdown
Owner Author

rysweet commented Apr 2, 2026

Review Findings (Steps 10 + 16)

Reviewer Agent

  • FIXED: status: disabled now treated as terminal (prevents 600s hang)
  • FIXED: ConnectTimeout=10 added to ssh_output
  • FIXED: Identity key (~/.ssh/azlin_key) now resolved and passed to all SSH/SCP operations
  • Tests are string-matching sanity checks — appropriate for this scope

Security Agent

  • No injection risks — commands are hardcoded literals, Command::new().args() avoids shell
  • StrictHostKeyChecking=accept-new + BatchMode=yes correctly set on all new SSH calls
  • Malicious VM can fake status: done but consequence is benign (same as timeout path)
  • No blocking findings

Philosophy Guardian (Zen-Architect)

  • Score: A- — Clean, purposeful change. Single responsibility, graceful degradation.
  • No forbidden pattern violations
  • Suggestion: SCP helpers still inline StrictHostKeyChecking instead of using base_ssh_args() — minor DRY concern, not blocking

All three reviews pass. No blocking issues remain.

@rysweet rysweet marked this pull request as ready for review April 2, 2026 02:08
Ryan Sweet and others added 2 commits April 1, 2026 19:20
The cloud-init script had two issues that caused set -euo pipefail to abort
before installing gh, az, node, and other tools:

1. update-alternatives --set python3 python3.13 breaks apt tools because
   apt_pkg is built for system Python 3.12, causing apt-get update to fail
   with "No module named 'apt_pkg'" - fixed by installing python3.13 without
   changing the system python3 default

2. get-pip.py fails on Ubuntu 24.04 because pip 24.0 is already installed as
   a debian package - removed the pip reinstall entirely

3. Em dash (U+2014) in shell comment caused Azure CLI latin-1 encoding error
   when passing --custom-data - replaced with ASCII hyphen

Fixes #929

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rysweet rysweet merged commit db04cfd into main Apr 2, 2026
12 checks passed
@rysweet rysweet deleted the fix/issue-929-cloud-init-wait branch April 2, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: VM creation does not wait for cloud-init completion — tools missing on connect

1 participant